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We consider the limitations of two techniques for detecting nonlinearity in 
time series. The first technique compares the original time series to an ensemble 
of surrogate time series that are constructed to mimic the linear properties of 
the original. The second technique compares the forecasting error of linear and 
nonlinear predictors. Both techniques are found to be problematic when the data 
has a long coherence time; they tend to indicate nonlinearity even for linear time 
series. We investigate the causes of these difficulties both analytically and with 
numerical experiments on "real" and computer-generated data. 

In particular, although we do see some initial evidence for nonlinear structure 
in the SFI dataset E, we are inclined to dismiss this evidence as an artifact of 
the long coherence time. 
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"May you have interesting data. " 
— ancient Chinese curse 

1 Introduction 

For time series that arise from chaotic systems, there are certain quantities (e.g., the fractal 
dimension of the strange attractor, or the spectrum of Lyapunov exponents) that are es- 
pecially interesting because they characterize intuitively useful concepts (number of active 
degrees of freedom, or rate of divergence of nearby trajectories) and are invariant to smooth 
coordinate changes. Algorithms for estimating these quantities are available, but they are 
notoriously unreliable, and often rely heavily on the skill and judgement of their opera- 
tors. There is an embarrassing lack of consensus, even among the so-called experts, on what 
constitutes a good estimate of dimension or Lyapunov exponent, or even whether chaos is 
present in a give time series. To some extent this difficulty may be attributed to inadequate 
comparison of one algorithm to another (and this conference is aimed at addressing that 
inadequacy), but to some extent, it is just a hard problem. 

The problem is arguably hard enough for long noise-free data sets generated on a com- 
puter from low-dimensional maps or differential equations. For "real" data, as the speakers 
at this conference have repeatedly emphasized, the problem is far more difficult. (And as 
the organizers have repeatedly reminded us, far more valuable.) Real data is contaminated 
with noise (which is rarely additive, Gaussian, or white), is measured with finite precision, 
and is subject to innumerable external influences in the environment and the measurement 
apparatus. And of course there is never enough of it. 

In this article, we will describe (yet) another source of difficulty that arises in the anal- 
ysis of time series data. The particular problem of detecting nonlinear structure — either 
by comparison of the data to linear surrogate data, or by comparing linear and nonlinear 
predictors — is seen to be complicated when the data exhibits long coherence times. 

In this section we define some terms and discuss linear modeling of time series. Section 2 
describes the method of surrogate data, and compares two approaches to generating surrogate 
data. We find that both have difficulties trying to mimic data with long coherence time. We 
illustrate these problems with real and computer-generated time series in Section 3, including 
the time series E.dat from the the SFI competition. In the last section, we discuss what it 
is about the analysis or the data that is problematic. 

1.1 Terminology 

A time series is a sequence of measurements X\,X2, ■ ■ ■ of some physical system taken 
at regular intervals of time. A time series can be thought of as a particular realization of a 
stochastic process, which we will define as a sequence of random 1 variables . . . , A_i, Xq, X\, 
X2, . . .. We make this distinction because theorems and formal definitions are available only 
for processes, while the whole purpose of generating this formalism is to assist researchers 
who are confronted with real experimental time series. 

x Note that even a deterministic process is usefully defined as a sequence of random variables. For the 
logistic process, X t +\ = 4X t (l — X t ), for instance, each variable X t has a nontrivial probability distribution 
P(X t ), but the joint distribution P(X t +\, X t ) reflects the deterministic law. 
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We will also distinguish the terms system and model, by letting system refer to the 
actual underlying physics, 2 and model to a hypothetical description of the system. Since 
the model will (for our present purposes) be inferred only from the time series, we cannot 
expect the model to be expressed in terms of the physics. But, although the model is 
really nothing more than an operational description of the time series, the hope is that this 
description — in conjunction with knowledge of the appropriate physics — will actually 
say something useful about the underlying physical system. When we talk about a best or 
"correct" (always in quotation marks!) model, we will mean the model which — out of a 
(usually parametric) family of models — has the least root mean squared (rms) error in its 
one-step- ahead forecast. 3 

Three statistics of particular interest are the mean \x = (X t ), the variance a 2 = ((X t — 
yu) 2 ), and the autocorrelation function A(t) = ((X t — /j>)(X t _ T — yu))/cr 2 . Here (•) represents, 
for the process, an ensemble average. If the process is ergodic, the average could also be over 
time t, and in that case, good sample statistics can be defined from a single time series. 

When we speak of "coherence time," what we mean is the time beyond which a signal 
becomes uncorrelated with its past. We can formalize the concept somewhat by defining 
coherence time as that time r such that the absolute value of the autocorrelation function 
|^4(T)| is smaller than some pre-specified value e for all T > r. This is to be distinguished 
from the first time T that the autocorrelation A(T) drops to a value below e; in other 
words, we are interested in the "envelope" of the autocorrelation curve. This is not really 
satisfactory as a formal definition of coherence time — for one thing, it depends on the choice 
of e — but it is adequate for our current purposes. 

In general, however, if the autocorrelation A(T) vanishes exponentially fast as T — ► oo, we 
will say that the coherence time is finite; if it does not vanish at all (if lim supy^^ |>1(T)| > 0), 
then we say that the coherence time is infinite. 4 

1.2 Wold decomposition 

The Wold decomposition is the fundamental theorem of linear time series analysis (e.g., see 
Ref. [2, §7.6.3]). This theorem states that any stationary zero-mean process (linear or nonlin- 
ear) can be decomposed into the sum of two uncorrelated components: one "deterministic" 
and one "indeterministic." That is 

x t = z t + u t (1) 

where the linearly deterministic z t can be modeled exactly with a (possibly infinite) linear 
combination of past values, and where the indeterministic u t can be modeled by a moving 

2 Mathematically, the system is equivalent to the process, but the connotation we mean to imply for a 
system is that it is physical. 

3 This is a convenient but basically arbitrary criterion. As Tsay [56] emphasizes, "it is well known that 
the best model with respect to one checking criterion may fare badly with respect to another criterion." 
One definition that is consistent with these constraints is r = 

Et=i V 1 ~ E2 (T), where E(T) is the 
rms forecasting error T time steps into the future for the best linear model, normalized by the standard 
deviation of the data. We won't actually be using this definition (the informal description in the text will 
be adequate), but it does seem appropriate to at least write such a definition down. 
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average of uncorrelated innovations. 5 

oo 

z t = ^2 a i z t-i (2) 

i=l 
oo 

U t = J2fr e t~i ( 3 ) 



i=0 



It can be shown that the autocorrelation A Z (T) = {z t z t -T) / {z 2 ) of the deterministic 
component of the time series will be significantly nonzero for arbitrarily large T, whereas 
the autocorrelation A U (T) of the indeterministic time series will approach zero as T becomes 
large. Since u and z are uncorrelated (again, not necessarily independent, but satisfying 
(u t z t ) = 0), it follows that the autocorrelation in the full time series is 

A m _ MT)(u 2 ) + MT)(z*) 
x[ } ~ (u*) + (z*) • [ ) 

From the point of view of the Wold decomposition, then, a process has a finite (resp. infinite) 
coherence time if and only if its linearly deterministic component is zero (resp. nonzero). 

1.3 Linear modeling of time series (ARMA) 

It follows from the Wold decomposition theorem that any stationary process can be modeled 
as an autoregressive moving- average: 

oo oo 

Y, b i e t-i- ( 5 ) 

i=l i=0 

For instance, given and (3i from Eqs. (2-3), one can take a; = and bi = fa — 12)=i 
However, this is not necessarily a unique solution. For indeterministic time series, for in- 
stance, it is possible to write the time series as a pure auto-regressive (AR) 



x t = x + ^diXt-i + ae t , (6) 
or as a pure moving- average (MA) 

oo 

x t = x + bi^t-i- (7) 

For time series with infinite coherence time (nonzero linearly deterministic component), 
however, a full ARMA model is typically required. 

In the study of linear Gaussian processes, the innovations are taken to be independent 
and identically distributed (IID) Gaussian random variables. For indeterministic time series, 



5 These innovations are uncorrelated, or "white," but they are not necessarily independent. This means 
(e t e t <) = for t t' , but not that the joint distribution P(e t ,e t <) is equal to the product of the marginal 
distributions P(e t )P(e t /). The innovations are treated as "noise" in linear analysis, but they may well possess 
nonlinear deterministic structure. The Wold decomposition is quite general, and applies to all stationary 
processes, including low-dimensional chaos. 
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which can be written as a pure moving average of the Gaussian innovations, this implies 
that the time series itself will be Gaussian. However, if a deterministic component is present 
(again, that means an infinite coherence time) then Gaussian innovations do not necessarily 
imply Gaussian data. For example a sine wave with added Gaussian white noise can be 
modeled as a linear process with Gaussian innovations but the time series is not Gaussian. 

Lii and Rosenblatt [31] have discussed linear (indeterministic) processes with non-Gaussian 
innovation; they show that these processes are far more complicated than those with Gaus- 
sian innovations. 

2 Surrogate data 

Surrogate data is artificially generated data which is to be used in place of an original data 
set; the main purpose is to provide a kind of baseline or control against which the original 
data can be compared. In tests for chaos, for example, one can control against artifacts due 
to autocorrelation in a time series by generating surrogate data from a random process that 
mimics the autocorrelation of the original time series. Suppose some algorithm indicates 
low-dimensional chaos in a time series. If the same algorithm also indicates low-dimensional 
chaos in the surrogate time series, then one can dismiss the original evidence for chaos as an 
artifact of the autocorrelation. 

More formally, the method provides a mechanism for testing well formulated null hy- 
potheses. It can be difficult to precisely formulate interesting null hypotheses, and often 
very difficult to prescribe a surrogate data generator which is appropriate for such a null 
hypothesis. Our work has focused on tests for nonlinearity which take linearly correlated 
Gaussian noise as the null hypothesis. In this case, one is not looking for chaos per se, but 
for some statistic which is significantly different for the original time series than it is for the 
linear surrogates. The existence of such a statistic implies that the original time series is 
inconsistent with the null hypothesis, and therefore that the original time series is nonlinear. 

While the systematic application of this approach to tests of potentially chaotic time 
series has only recently become fashionable, the basic idea is by no means new. Monte- 
Carlo methods for generating data sets with specified properties are widely used, and in 
some applications have reached the status of recipes [38, §14.5]. Statisticians have long 
advocated resampling (so-called "bootstrap") methods, in which new data sets are generated 
by randomizing the original data set in some prescribed way. We have found the writing 
of Efron in particular to be enlightening and inspirational [10, 11]. The purpose of these 
methods, however, is usually not to test a hypothesis, but to estimate confidence intervals 
for some statistic of interest. 

The application of these resampling methods to time series is complicated by the tem- 
poral dependence of time series data; most of the original bootstrap applications considered 
individual data points to be independent events. An indirect approach is to remove the 
linear dependence in the data by considering the innovations (the "residual" time series) of 
an ARMA model [11], though the filtering required to produce the residuals can make it 
harder to "see" the nonlinearity in a chaotic time series [50]. Direct resampling techniques 
based on temporal "blocks" of data were discussed by Kunsch [26], and an improvement was 
developed by Politis and Romano [30, 37]. While further exploration is certainly called for, it 
is not clear to us that these methods (at least as they have been applied in Refs. [26, 30, 37]) 
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can be used in conjunction with dynamical statistics for the purpose of hypothesis testing. 

Parametric bootstraps, instead of resampling the data directly, use the data to set pa- 
rameter values, and then use these values in a parametric model for generating new data. 
An incomplete list of authors who have successfully used this approach include: Grass- 
berger [15], who used a simple linear autoregressive process to generate a time series which 
mimicked properties of a climate data set originally purported to exhibit low-dimensional 
chaos; Kurths and Herzel [27], who compared estimates of dimension and Lyapunov expo- 
nent for a time series of solar radio pulsations with those for data from an AR(5) model 
that fairly accurately matched the spectral properties of the original data; Brock et oi. [5], 
who generated surrogate financial time series to test trading strategies; Ellner [12], who used 
this approach to show that a variety of nonchaotic "plausible alternatives" might adequately 
explain measles and chickenpox data; and Tsay [56], who provides an excellent overview of 
the approach with a wide variety of applications. 

Kaplan and Cohen [23] published the first example we are aware of in which the evidence 
for chaos in a time series was evaluated by comparing against a control data set that was 
generated by the Fourier transform (FT) method which is described in Section 2.1. Somewhat 
earlier, Osborne et oi. [35,36] inverted l/f a spectra using an inverse Fourier transform to 
generate realizations of 1/ f a noise, and then showed that dimension estimates of these time 
series were problematic. (This issue has been further discussed in Refs. [39,49].) The use of 
multiple surrogate data sets for more formal statistical hypothesis testing was suggested in 
Refs. [47, 48] and implemented in Refs. [51, 52] for a variety of examples. Smith has applied 
the surrogate data methods to fluid dynamical time series [45], and more recently, to address 
the issue of inherent periodicities in the climate record. 6 A variant of the surrogate data 
approach has also been described in Ref. [25]. 

The use of formal statistics, in which the null hypothesis is explicitly spelled out and 
carefully tested against, is only lately gaining popularity in the chaos community. Brock, 
Dechert, and Scheinkman [4] deserve to be singled out for creating perhaps the first sta- 
tistically rigorous application of the Grassberger-Procaccia [16] correlation integral for time 
series analysis. This work has led to a veritable industry in the economics community involv- 
ing the application of statistics which incorporate the explicit recognition of chaos [3, 6, 7, 
19, 20, 28, 29, 43]; these complement the more classical approaches taken by the statisticians 
[18, 24, 33, 46, 54-56]. Many of these are reviewed in Tong's comprehensive book [53]. 

2.1 FT-based surrogates 

To test for nonlinearity, we begin with the pre-supposition that the time series is linear. A 
more precise formulation of the null hypothesis is that the data arise from a linear stochastic 
process with Gaussian innovations. 7 

The algorithm we generally use for making linear surrogate data is based on the Fourier 
transform (FT). Specifically, we compute a discrete Fourier transform of the original data, 
and replace the phases at each frequency with random numbers in the interval [0, 2n) while 
keeping the magnitude at each frequency [i.e., the power spectrum) intact 8 , and then apply 

6 L. A. Smith (personal communication). 
An extended null hypothesis which considers that there is an underlying process that is Gaussian, but 
one is observing a static nonlinear transform of that process, is discussed in Ref. [51]. 

8 It is important that the phases be symmetrized in such a way that the inverse Fourier transform is real 



Theiler, Linsay, and Rubin, p. 7 



the inverse Fourier transform to produce the surrogate time series. 

This is a kind of nonparametric bootstrap which by construction produces surrogates that 
have the same power spectrum as the original data. In fact, the surrogate time series have 
exactly the same sample power spectrum as the original time series. The Wiener-Khintchine 
relations assure us that two processes with the same power spectrum will also have the same 
autocorrelation function, but in comparing the sample statistics, we have to be more careful. 
Jenkins and Watts [21] note that there are (at least) three different ways to define a sample 
autocorrelation (In these definitions, the time series is for convenience assumed to have zero 
mean): 

• Unbiased estimator: j^y^ J2tL^ T %t x t+T- 

• Biased estimator (lower variance than unbiased estimator): Z)£l T X t x t+T- 

• Circular autocorrelation: (E^I"i T X t x t+T + E^w-t+i x t x t+T _ N ^ . 

The estimators agree to order 0(T/N), and for T<iV and iV — > oo all three approach 
the actual autocorrelation of the process. But for finite iV they are only approximately equal. 
And of the three, it is the circular autocorrelation that is exactly preserved in going from 
the original to the surrogate data sets. 

For a Gaussian linear process, all of its properties are encoded in the mean, variance, 
and autocorrelation. But when we say that the "linear" properties of the time series are 
preserved in the surrogate time series, what that means exactly is that the sample mean, 
sample variance, and circular autocorrelation are preserved. 

2.2 ARM A model-based surrogates 

Instead of attempting to exactly preserve some preselected set of sample statistics, an al- 
ternative approach for generating surrogate data is to directly fit the data to a constructive 
parametric linear model, such as a finite-order ARMA(p, q): 

p q 

xt = x + ^2aiX t -i + ^2bie t _i. (8) 

Constructing a parametric model from a finite set of data involves choosing the "correct" 
values for q and p, and this is an issue of some subtlety; one wants enough terms to capture 
the correct correlations in the data but not so many terms that the data is over- fit. Aikake [1] 
and Schwarz [44] have suggested fairly general criteria; a more recent discussion specific to the 
ARMA model can be found in Ref. [40]. For fixed values of q and p, the optimal parameters 
(«;, bi) depend in principle only on the autocorrelation of the stochastic process. 

If there is no deterministic component (z t = in Eq. (1)), then an ARMA(p, q) process 
can be modeled by a pure autoregressive AR model or a pure moving- average MA model, 
of appropriately large order. 9 We note that in practice it is much easier to fit coefficients 

and the power at each frequency is unaffected; we remark that the recipe for doing this in Ref. [52, §A.1,#4] 
is incorrect. We are indebted to W. Schaffer for pointing out this error. 

9 But in general a pure AR or pure MA will be less parsimonious than the best ARMA(p, q) model; that 
is, the AR or MA models will usually require more than p + q parameters. Having said this, we should 
further note that the ARMA formalism doesn't necessarily generate the most parsimonious description of 
linear Gaussian processes either. 
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to a pure AR model than to an MA or ARMA model. In that case, assuming a zero-mean 
process for convenience, the formula is given by [2, p. 187]: 



These are sometimes called the Yule- Walker equations. 

Having determined the appropriate ARMA model, one can generate surrogate data by 
inserting Gaussian IID random numbers into the e t terms, and then iterating Eq. (8). One 
is assured that in the long run, the autocorrelation in the surrogate data will approach the 
autocorrelations used in Eq. (9), but note that this is different from the exact match of 
sample statistics that is seen in the FT surrogates. 

A common alternative practice is is to bootstrap the residuals themselves. Having fit the 
model to the data, one derives a time series of residuals e t which are then scrambled and 
re-inserted into Eq. (8). This avoids the assumption of Gaussian innovations, and therefore 
leads to a broader class of time series, and presumably tests against a looser null hypothesis; 
however, linear processes with non-Gaussian innovations do not always behave in "linear" 
ways — for instance, see Tong [53, pp. 13-14], Lii and Rosenblatt [31], or Kanter [22] for 
examples of some of the pathologies. 

It is also worth noting that this AR model is also the optimal linear predictor; that is 
the average squared errors 



are minimized when the coefficients chosen according to Eq. (9). 

2.3 Comparison of FT and ARMA surrogates 

The superficial equivalence of FT and ARMA modeling rests with the notion that both the 
Fourier spectrum and the AR coefficients depend only on the autocorrelation function of the 
original time series, which is (at least approximately) mimicked by the surrogates in both 
cases. 

The difference between FT and ARMA surrogates is basically the difference between 
"fitting the data" versus "fitting the model." FT surrogates exactly match certain sample 
statistics (mean, variance, and circular autocorrelation) of the original data. ARMA sur- 
rogates are generated from a model that is fit to the original time series. These surrogates 
exhibit sample statistics that are usually but not necessarily in approximate agreement with 
those of the original time series. 

Another difference is the way the data sets are generated. The FT method makes a 
whole new time series all at once, and it necessarily has the same length as the original time 
series. The ARMA method generates new points iteratively, one at a time, and can generate 
arbitrarily long or short data sets. This is not necessarily an advantage, though. Generating 
points sequentially, one is vulnerable to instabilities that may amplify small errors into large 
effects in the long term. This is a general difficulty with model-based surrogate data methods; 




(9) 




2 



(10) 



Theiler, Linsay, and Rubin, p. 9 



two models which are approximately equal (say, have nearly equal ARMA coefficients) can 
give rise to time series that are markedly different. 10 A second difficulty that arises when 
modeling processes with long coherence times is that the qualitative long term behavior, 
even the overall amplitude of the process, can depend not only on the model parameters but 
on initial conditions as well. 

While the FT method is nonparametric, in the sense that one does not directly fit a model 
to the data, one can think of it rather as having a very large number of parameters, iV/2, 
corresponding to the amplitude of the power spectrum at N/2 frequencies. As a model, 
then, the FT provides an extreme overfit to the data. By contrast, ARMA models are 
parsimonious, in that the modeler is (usually) careful to choose the minimum number of 
parameters needed to fit the data. 11 

Which approach is preferable depends on the application. Our view is that FT surrogates 
are better for testing hypotheses, while ARMA surrogates may be better for estimating 
confidence intervals. Certainly the FT surrogates will be useless for estimating confidence 
intervals for estimates of mean, variance, and autocorrelation. 

3 Application to time series data 

In this section, we will investigate four different time series. The first is a real time series 
that was part of the SFI competition. Though we seem to see evidence for nonlinearity in 
this time series, we give reasons to suspect the results. The second data set is an artificially 
generated sine wave plus noise. This data is meant to be a caricature of the real data, but 
a caricature whose underlying process is known. With this second data set we are able to 
see the same effects that we observed with the "real" data set, and thereby confirm our 
suspicions that the effects we saw were artifacts of the long coherence time. The third data 
set is also, strictly speaking, a sine wave plus noise, but it is a particularly simple example 
that permits some analytical discussion. For the third data set, we compare the theoretical 
efficiency of linear versus nonlinear predictors. Finally the last data set is a sum of two 
commensurate sine waves with some added noise; in this case, we see numerically we we 
described in theory for the third data set: namely, that the prediction error of a nonlinear 
model fit to the data is smaller than the error of a linear model fit to the data. 

3.1 The investigation of E.dat 

We apply tests for nonlinearity based on the method of surrogate data to the SFI data 
set E.dat. These data are observations of the light curve of a variable white dwarf star, 

10 For nonlinear modeling, this can be extremely problematic. A parametric model that exhibits chaotic 
behavior, for instance, can with an arbitrarily small change in parameter, give rise to stable periodic behavior. 
This is sometimes referred to as the genotype /phenotype conundrum. One associates genotype with equations 
of motion, and phenotype with the long term behavior of those equations. Small perturbations in the 
genotype can give rise to huge differences in phenotype. And inferring the genotype from the phenotype is 
much more difficult than the other way around. We should remark that for linear modeling, the difficulty 
is not this extreme. In this case, if the roots of the characteristic polynomial of the AR part of the model 
are well within the unit circle, then a small perturbation of parameters will not grossly affect the overall 
behavior. (However, we might also remark that for high order polynomials, small changes in the coefficients 
can lead to large changes in the roots.) 

11 The problem of parsimony and "effective number of parameters" in nonlinear modeling is much more 
subtle in the case of nonlinear modeling; see Refs. [34, 57] for interesting discussions of this issue. 



Theiler, Linsay, and Rubin, p. 10 



and are sampled every ten seconds. We concentrate on a single series #14, chosen more 
or less arbitrarily. 12 Fig. 1(b) shows the first N = 2048 points of this time series. The 
most noticeable feature is the coming and going of an oscillation with a period of 50 time 
units (500 seconds, or about 8.3 minutes). We computed discrete Fourier transforms on 
all all seventeen data sets, using data segments of varying length and location in the time 
series, and both with and without a Hanning window. (See Fig. 2(a) for a particular case.) 
We see considerable variation, and would not be confident in attempting our own detailed 
interpretation of the power spectrum. 13 However, we do consistently see two peaks in the 
vicinity of the dominant frequency (0.002 Hz), suggesting that the signal is quasiperiodic 
and that the "coming and going" may be a beating phenomenon. The autocorrelation curve 
(Fig. 2(b)) supports this interpretation, and also indicates that the coherence time is at least 
on the order of a thousand time steps, and possibly much longer. 

In searching for nonlinear structure, any nonlinear statistic in principle is adequate. We 
used an estimator of fractal dimension, obtained from the slope of a correlation integral [17] 
at a point r equal to half of the rms amplitude of the time series. While this is not our best 
shot at what the actual dimension is (in fact, for this data, we do not really even see a hint 
of low-dimensionality), it does provide a nonlinear statistic against which we can compare 
real data to surrogate. What we see in Fig. 3(a) is that — for this statistic — the real and 
surrogate data are indistinguishable. We quantify significance by counting the number of 
"sigmas" between the original and surrogate values for the discriminating statistic, where a 
"sigma" is the standard deviation of all the values of the statistic computed for the surrogate 
data sets. 

Because the data set has a lot of what appears to be high frequency noise, we also 
considered a crude low-pass linear filter of the data, based on a moving average (equal 
coefficients) of ten sample points. That is, x' t = (x t + x t _\ + • • • + a^-g^lO. Fig. 1(e) 
shows how smoothing affects this data set, and in Fig. 3(b), we again compare real data to 
surrogates. At about the four sigma level, the difference between the real data and surrogates 
is statistically significant. Inspection of the actual values, however, reveals that the difference 
is never more than 8%; we are inclined to remark that the difference is "significant," but not 
very "substantial." When we used nonlinear forecasting error instead of estimated dimension 
as our discriminating statistic, we did not see any significant evidence for nonlinearity for 
either the smoothed or the raw data set. 

Now, if the surrogate data really is mimicking all the linear properties of the original 
time series, then any linear statistic computed from both surrogate and original data should 
give the same value. We plot one such statistic, the in-sample fit error of the best linear 
model, in Fig. 4(a,b). For both E14.dat and the smoothed E14.dat, there is a small but 
statistically significant discrepancy. So the surrogate data evidently is not mimicking "all" 
of the linear properties. The technical explanation is that the in-sample fit error is a sample 
statistic which does not depend precisely and entirely on the circular autocorrelation. That 
the discrepancy should be systematic, however, is an artifact of long coherence times, as we 

12 We were partly motivated to use this series because we knew that M. Palus (in this volume) had looked 
at the same series. 

13 We have not attempted to use the information (which was provided) which gave the absolute starting 
times for each of the seventeen time series. Combining the data into one long time series with appropriate 
gaps should permit much more precise spectral estimation. 
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Figure 1: (a,b,c) The top three time series are (b) SFI data set E.dat #14, and (a,c) two surrogate 
data sets. (d,e,f) The bottom three are (e) set E.dat #14 smoothed with a moving average window 
of size ten time steps, and (d,f) two of its surrogates. Figures (b,e) are the first N = 2048 points 
of an approximately 2600 point data set. 
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Figure 2: (a) Power spectrum, computed by a discrete Fourier transform from all N = 2602 
points in the time series E14-dat, using a Planning window, (b) Autocorrelation of the data set 
El^.dat computed with the biased estimator A(T) = X)£l T X t x t+T> w ^ere x' t = (xt — \i)jo is the 
normalized time series value. 



show in Section 3.2. 



3.2 Sine wave plus noise 

To investigate the effects of long coherence time in a situation where we know the underlying 
process, we generated artificial data with infinite coherence time by adding measurement 
noise to an underlying sine wave. We chose the period and noise level to (very crudely) 
approximate that of the smoothed E.dat. 

In general, as Fig. 5 shows, generating surrogate data by the FT algorithm leads to 
surrogates that do not have the coherent structure of the original sine wave. It is possible 
to generate good surrogates by fortuitous choice of data length. For periodic data, this is 
only a slight inconvenience (requiring the use of a general discrete Fourier transform (DFT) 
instead of the fast Fourier transform (FFT) which requires data length to be a power of 
two); for quasiperiodic data, this is trickier, because one must choose the length of the time 
series to be (at least approximately) commensurate with both periods. 

There are two effects going on here. The first involves choosing the length so that the 
periodic continuation is at least continuous (doesn't have a jump). If this is not done, one 
introduces spurious high frequencies into the data. This effect can be alleviated to some 
extent by windowing the data, e.g., with a Hanning window (see Ref. [51, §2.4.2]). The 
second effect involves choosing the length of the time series so that all the relevant periods 
are commensurate with this length. If this is not done, then the DFT takes the power from a 
single frequency and distributes it to adjacent frequency bins; upon inverting the DFT after 
randomizing the phases, one sees a beating between the adjacent frequencies instead of the 
pure frequency in the original time series. In this second case, windowing the data does not 
help. 

Using this sine wave plus noise, we see in Fig. 6 that a dimension-based test for non- 




Figure 3: Significance of evidence for nonlinearity based on an estimate of the correlation dimen- 
sion: (a) for data set E14-dat, and (b) for the smoothed data set. The top panels show estimated 
dimension for real (n) and surrogate (+) data, as a function of embedding dimension. The bottom 
panels show "number of sigmas" as an indication of the statistical significance with which one can 
reject the null hypothesis that the experimental data is linear. 

linearity is able to distinguish the real and surrogate data with high statistical confidence. 
Again, the difference is extremely "significant" but not especially "substantial." Further, as 
Fig. 4(c) shows, the data is also distinguished from the surrogates by a linear statistic. 

We argued in Section 3.1 that the evidence for nonlinearity observed in E.dat might be 
an artifact of the long coherence time. In this section, we have shown that the effects seen 
with E.dat are also seen in a data set which by construction is linear, but which has a long 
(in fact, infinite) coherence time. 

3.3 Linear versus nonlinear modeling: an example 

Another way to test for nonlinearity in a time series is to compare the linear and nonlinear 
models to see which more accurately predicts the future. For example, Casdagli [8, 9] has 
described an "exploratory" approach in which the data is fit with local linear models using 
k nearest neighbors. The parameter k is swept from m + 1, the minimum value required 
to make a local linear fit in m dimensions, up to the size N of data set itself. For k < N, 
the model is nonlinear, but for k = N it is equivalent to a globally linear model. If the 
error decreases monotonically with k, then the process is taken to be linear. If the error 
increases monotonically with k, then the process is taken to be nonlinear and deterministic. 
If, as most often happens, the error first decreases with increasing k and then increases, the 
process is taken to be nonlinear and stochastic. 

Kanter [22] has shown the unsurprising result that for an indeterministic linear Gaussian 
process, the optimum predictor is a linear predictor. 14 We consider a slightly different case 



14 What is surprising in Kanter's paper is that linear non-Gaussian processes can be more accurately 
modeled with a nonlinear predictor. We are grateful to J. Scargle for pointing out this reference to us. 
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Figure 4: A linear statistic, the in-sample rms fitting error, is computed for linear models with 
embedding dimension m. Here (a) is for El^.dat, (b) is for the smoothed data, and (c) is an 
artificially generated sine wave with measurement noise. It is apparent, particularly for cases (b) 
and (c), that the surrogate data is not as linearly predictable as the original data. 
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Figure 5: Sine signal with white measurement noise, and surrogate data sets generated by the FT 
algorithm without windowing. In (a,b,c), the length of the time series is a convenient (for FFT 
purposes) power of two, N = 2048; the original data set is in (b), while (a) and (c) are the surrogate 
data sets. In (d,e,f), we use the same time series, slightly truncated to N = 2014 points, so that 
there is a near-integral number of oscillations in the time series. Again the middle data set (e) is 
the original, while (d) and (f) are two surrogates. 



Theiler, Linsay, and Rubin, p. 16 




Figure 6: Evidence for nonlinearity in a time series composed of a sine wave plus white noise. While 
there is no indication for low- dimensionality (there is too much noise to see that the underlying 
signal is one- dimensional) , the estimated dimension is significantly different for the original data 
than for the surrogate data. 

than Kanter studied; we look at a time series that is generated by a linear process, and 
compare the linear and nonlinear models that are fit to the finite length of the time series. 
What we find for data with long coherence times is that the nonlinear models are often 
superior. 

Consider again a sine wave with additive measurement noise, but to keep the analysis 
simple, take the sampling rate to be exactly twice the frequency of the signal. The time 
series is produced by x t = s t + n t , where the signal s t = ( — l) f S is alternating in sign while 
maintaining a constant amplitude S, and the noise n t is a white noise process of amplitude 
o. That is, 

x t = (-iyS + ae t , (11) 

where the e/s are unit variance IID Gaussian random variables. The time series is linear, 
and can be produced by the ARMA model 

x t = -x t _i + ae t + crei_i, (12) 

with appropriate initial conditions. In fact, the initial conditions are crucial; notice that the 
signal amplitude S does not even appear in Eq. (12). This is a general property of processes 
with infinite coherence times. For a process with a finite coherence time, the amplitude of 
the signal is determined solely by the coefficients of the ARMA model. 15 



15 This is another disadvantage of ARMA surrogates compared to FT surrogates; the FT surrogates by 
construction will possess the same amplitude as the original time series. 
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3.3.1 Linear modeling 

Let M {m) denote the best order-m autoregressive (AR) linear predictor; here 



E ^k x t-k, 
k=l 



where the coefficients are chosen to minimize the mean squared error of the process: 



({x t - x t f) = ((x t 



E a k x t _ k f) 



k=l 



1 



£(-i) fc «* 

k=l 



s 2 + 



k=l 



With a little algebra, one can show that these coefficients are given by 

C2 

dk = (-1J J' 

and that the average squared error of this optimal predictor is given by 

a 2 S 2 



E 2 \M n 



m 



= <0 



= + 



(13) 



(14) 



(15) 



;i6) 



;it) 



mS' 2 + a 2 

Note that the error decreases monotonically with increasing m, approaching a "floor" of 
<j 2 as m — y oo. It is basically impossible to beat this error with any model, linear or nonlinear, 
because it is the noise on the signal which is by definition unpredictable. It is also worth 
remarking that the convergence of the linear AR model is algebraically slow with embedding 
dimension m. For chaotic processes (or more generally, for any stochastic processes with a 
finite coherence time), the convergence is usually faster. 

This error assumes that the "correct" model is chosen for a given order m. In practice, 
one fits a model M s to a finite sample of TV data points. The fit is optimal for the data in 
the sample set, but in general is not optimal for out-of-sample data. Particularly when m 
is large, and TV is small, the difference between the out-of-sample error for "correct" model 
and for the fit model can be significant. 

The effect can be quantified with the aid of the Akaike Information Criterion (AIC) [1], 
which provides a measure of the difference between in-sample error E s and out-of-sample 
error E Q for the best in-sample model M s . (See Tong [53, §5.4] for a modern discussion.) 
Here, 

log(E [M s (m, N)]) = \og(E s [M s (m, N)]) + m/N. (18) 

We have observed numerically that the error of the "correct" model M lies roughly half way 
between the in-sample and out-of-sample error of the in-sample fit model M s ; that is, the 
difference m/N in Eq. (18) can be split into two roughly equal components: 16 

log(E [M s (m,N)}) - \og(E s [M s (m,N)}) = 



la S. Ellner (personal communication) has provided a heuristic argument for why the terms should be equal. 
The argument notes that the difference in error between M a and M s can be expanded as a Taylor expansion 
in o — S (where represents the finite vector of parameters in model M), and that the relevant term is the 
second derivative of E a and E s , respectively, multiplied by (0 o — 8 S ) ■ (6 — 8 S ). Since one expects E s and E a 
to be asymptotically equal (as TV — ► oo), it follows that their second derivatives should also be asymptotically 
equal; thus the expected differences _E S [M Q ] — _E S [M S ] and _E Q [M S ] — _E [M ] should also approach equality 
for large TV. 
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{\og{E [M a {m,N)]) - log(E [M (m)})} 
+ {log(E s [M (m)])-log(E s [M s (m,N)})} (19) 

where 

log E [M s (m, N)] - \ogE [M (m)\ m m/2N (20) 

and 

\ogE [M (m)} - log E s [M s (m,N)} = 

log E a [M (m)] - log E s [M s (m,N)] « m/2iV. (21) 

For large A" and large m (but m <C AQ, we can combine Eq. (17) and Eq. (21) to write 
the total squared error as the sum of noise (unavoidable), model inadequacy (m too small), 
and parameter mis-specification (N too small): 

E 2 [M s (m,N)} = a 2 + ^- + ^f (22) 

Thus, for a given finite JV, there will be an optimum m for which the total error is minimized. 
In particular, for large JV, the optimum model occurs when m = -\/~N, and the total squared 
error in this case is given by cr 2 + 2a 2 /\/N. 

3.3.2 Nonlinear modeling 

By contrast, consider as an example, the following parametric nonlinear model: 

x t = -S*sgn(x t _ 1 ) (23) 

where sgn is the "signum" or "sign" function; it's value is +1 or -1, depending on the sign 
of its argument. Here m = 1, and using a learning set of size JV, one can estimate the 
parameter S* to within an error of a j\fN . (This assumes a <C S so that sgn(x t ) = sgn(s t ) 



at almost every time step.) Then, the total squared error is given by 

{(x t -x t f) = ((ae t + (S - Sygnix^)) 2 ) (24) 

= a 2 + ((S - S*) 2 ) (25) 

= <j 2 + <j 2 /N. (26) 

Though both the linear and nonlinear model converge to the same "floor" in the N — » oo 



limit, the nonlinear model converges more quickly. For a given JV, the nonlinear model (with 
m = 1) beats the best linear AR model (with any m). 

One might argue that using this parametric form for the nonlinear model is unfair, since in 
general one does not know the nature of the model that generated the time series. However, 
we remark that this model is not far from a local linear approximation that uses A^/2 nearest 
neighbors. 17 

17 If a is small, then the N/2 neighbors of a point with x t > will be a cloud of points which all have 
Xt > 0. A linear fit to this data will be of the form x t +\ = A + B(x t — S) where A « S, in particular 
A — S ~ c/V^V, and B ~ l/^/N. It follows that the reduced squared error ((S — x t ) 2 ) will scale like a 1 /N . 
By contrast the linear m = 1 model, achieves ((S — x t ) 2 ) ~ a 2 . Already, at m = 1, the local-linear model is 
better than the best global linear model, which requires an embedding dimension m = -\/N, and achieves a 
reduced squared error that scales as a 2 /y/N. 
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In the example in Fig. 7, we consider S = 1, a = 0.3, and N = 128. The ratio of signal 
to noise power is S 2 /a 2 = 11. Figure 7 shows both the theoretical error and the results of 
numerical simulations for these parameters. 




0.1 H 



1 10 100 

m 

Figure 7: In-sample and out-of-sample errors are plotted against embedding dimension, for linear 
AR models fit to N = 128 data points of a signal plus noise process defined in Eq. (11 ), with signal 
amplitude S = 1 and noise amplitude a = 0.3. (a) Circles denote in-sample errors and pluses are 
out-of-sample errors for individual trials, (b) Median (with error bars given as the standard error) 
of the errors shown in the above panel. In both (a) and (b), we have plotted three theoretical curves. 
The dotted line corresponds to the expected error E[M (m)] of the "correct" order-m model, given 
in Eq. (17). The dashed line is the theoretical in-sample error of the best-fit model E s [M s (m, N)], 
given in Eq. (21), and the dashed-dotted line is the theoretical out-of-sample error of the best-fit 
model E [M s (m, N)], given in Eq. (20). 

3.4 Nonsinusoidal periodic signals 

The difficulties associated with periodic sine signals seem to be compounded when higher 
harmonics are added. (Some of these extra difficulties were also addressed in Ref. [42].) 
Consider a time series given by x(t) = sin(i) + sin(3i) + <re t , where o = 0.05 and e t is 
uniform noise with unit range. Even for very small a , a linear model of this data requires 
four past values to predict the future, because it has to estimate the phase and amplitude 
of two sine waves — the phase and amplitude are not coded into the model itself. One finds 
that for the same embedding dimension m, nonlinear models fit this data better than linear 
models. 
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In particular, as seen in Fig. 8, Casdagli's plots of forecasting error as a function of number 
of neighbors in the local linear fit [8, 9] indicate nonlinearity in a time series, even though the 
system is formally speaking linear. Our intuitive explanation is that the nonlinear models are 
able to use information that is unavailable to the direct linear model; namely the amplitudes 
and relative phase of the two sine waves. So, while the linear model requires four degrees 
of freedom, the nonlinear model is relatively successful with only one. We should emphasize 
that the Casdagli plot was intended as an "exploratory" method of time series analysis, and 
that it appears very well suited for that purpose. The ambiguity of interpretation that arises 
when the Casdagli plot is applied to data with long coherence times is a problem that is not 
unique to the Casdagli plot, but is just another artifact of the long coherence time. 18 

And in generating surrogate data sets, one again finds that nonsinusoidal periodicity is 
even worse than sinusoidal. As well as the usual difficulties, one has the added problem 
that the FT algorithm does not preserve the phase relation between the harmonics. It is 
this phase relation that determines the shape of the periodic waveform. ARMA modeling is 
even worse, because in that case, the model encodes neither the phase relation between the 
harmonics nor the relative amplitudes of the sinusoidal components. 

3.5 Aside: Chaos and long coherence times 

Although the situation we have described so far has been restricted to linear systems with 
noise, we note that fully deterministic chaotic systems can also exhibit long coherence times. 
While this may seem at first counter-intuitive, since positive Lyapunov exponents imply a 
finite "forgetting" time, the effect has been previously noted [13, 14] in the context of the 
Rossler flow [41], and is readily apparent in maps which exhibit "banded chaos." An example 
of the latter is the logistic map, x i+ i = Xx t (l — x t ), at parameter A = 3.6. The attractor 
is chaotic, but the orbit alternately visits two bands, one above and one below the fixed 
point at x = 0.72. This underlying period two motion is coherent over the full length of the 
trajectory. 

4 Discussion 

We provide three possible interpretations of the basic source of the problems that arise when 
surrogates of highly coherent time series are generated. The first is technical; the second 
and third have more of a philosophical, almost existential flavor. 

4.1 The surrogate data generator is flawed. 

One might argue that the inability of the FT algorithm to generate surrogates which mimic 
the original data indicates a flaw in the algorithm. For coherent signals, the true power spec- 
trum contains instrument ally sharp spikes. However, when estimating the power spectrum 
from a finite time series, the spike is spread out over several distinct frequency bins with a 
very specific phase relation between them. When these phases are scrambled, and the FT is 
inverted, the resulting time series has a shorter coherence time. 

One can imagine various ad hoc solutions, such as randomizing phases only for frequencies 
not in the vicinity of the dominant frequency. We have not investigated such modifications, 
and are hesitant to do so, since they are difficult to automate in a way that would be 



8 We are tempted to say that the problem lies not in the analysis but in the data itself! 
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Figure 8: Plot in the style of Casdagli for a time series generated by adding two sine waves and a 
small measure of white noise as well as for some surrogate time series (+). Although this is 

by construction a linear time series, the plot of forecasting error versus number of neighbors in the 
local linear predictor indicates that nonlinear models are superior to linear models. Here, N = 1024 
points are taken from the time series, and the embedding dimension is (a) m = 2, and (b) m = 4, 
which is in principle adequate for a linear model of two sine waves. Note that comparison with 
surrogate data also implies that the time series is nonlinear. 
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applicable to all time series. (For example, suppose one has a quasiperiodic time series, and 
that each of the component frequencies also has higher harmonic frequencies. Distinguishing 
the peaks from the broadband then becomes a nontrivial task.) 

Technical problems arise with ARMA models as well; namely with stability, and the 
difficulty of choosing the coefficients "just right." For ARMA modeling of the sine wave, 
even if the correct coefficients are chosen, there is no way to assure that the surrogates 
will be the same amplitude as the original time series; the amplitude of linear models is 
coded not in the coefficients but in the initial conditions. One must not only model the 
coefficients of a linear model, then, one must further restrict the initial conditions which 
are iterated. (Normally, when the coherence time is finite, the amplitude is specified by 
the coefficients, though its dependence becomes increasingly sensitive as the coherence time 
increases.) Rescaling so that the amplitude of the surrogates matches that of the original 
time series does not really solve this problem, because if the signal is composed of several 
sine waves, one must also find a way to maintain all their relative amplitudes. 

An interesting possibility which we have not pursued is to model the time series not as 
an ARMA but directly in terms of its deterministic and nondeterministic components; i.e., 
Eqs. (1-3). The surrogates would be generated with new white noise realizations in Eq. (3) 
for the indeterministic component, but the deterministic (coherent) component would be 
kept the same as the original. 

4.2 The time series is nonstationary. 

If the fault is not in the algorithm, then perhaps it is in the data. While stationarity has 
a clear-cut meaning for a stochastic process, it is a fuzzier concept when applied to a time 
series. The Lorenz flow [32] is a stationary chaotic process, but if a time series is taken over 
a short enough segment, it will appear very nonstationary. For a time series, we argue that 
an useful operational definition of stationarity is that the characteristic time scales in the 
data are much shorter than than the length of the data set itself. 

If Ave think of the coherence time a one of the characteristic time scales, then highly 
coherent time series are not stationary. 

It may seem odd to characterize a sinusoidal signal as nonstationary. But one way to 
see why this is reasonable is to consider two sine waves whose frequencies are nearly equal. 
The sum of the two sine waves will exhibit a low frequency beating as they slowly move in 
and out of phase with each other. If the length of the time series is shorter than the beating 
period, then the resulting time series will appear quite nonstationary. So, if we'd like the 
sum of two stationary time series to itself be a stationary time series, we cannot permit time 
series with long coherence times to be considered stationary. 

4.3 The time series is nonlinear. 

A third interpretation of the spurious identification of nonlinearity in time series with long 
coherence times is that the nonlinearity is not spurious at all. Typical linear processes do 
not produce long coherence times, because their parameters need to be precisely adjusted. 19 

19 For example, to generate a sine wave with additive noise (see Section 3.2) requires an ARMA(2,2) model 
x t = ciiXt-i + a 2 Xt-2 + fe t + &ie t _i + 6 2 e t _ 2 , where the roots of z 2 = a,\Z + a 2 must lie precisely on the unit 
circle (|ai| < 2, a 2 = —1; or a\ = —1, a 2 = 0), and b\ and 6 2 must be precisely equal to — <ia\ and — <ra 2 , 
respectively. If the roots are outside the unit circle, then the time series will diverge to infinity exponentially; 
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We note that those clean and highly coherent sine waves that come out of signal generators 
in the laboratory do not arise from RLC circuits, but depend crucially on the nonlinearity 
of the electronics. In nature, and in the laboratory, nonlinear limit cycles are very common 
and very robust. Thus, one might argue that a long coherence time is in itself evidence for 
underlying nonlinear dynamics. 

4.4 Summary 

Although a time series may be generated by a process that is formally linear, if it has a 
long coherence time it can often fool tests for linearity, and can be mistaken for nonlinear 
time series. In particular, it is difficult to generate surrogate data which mimics the linear 
properties of the process that generated the data. 

In testing a time series for nonlinearity, it is a good idea to compare with surrogate data 
using both nonlinear and linear statistics. Good evidence for nonlinearity requires that the 
nonlinear statistics do distinguish the real and surrogate data, and that the linear statistics 
do not. Even more important is to plot an autocorrelation curve for the real data, and 
make sure that the autocorrelation A(T) vanishes as T gets large. If there is significant 
autocorrelation for T on the order of the length of the time series, then one must beware the 
dangers of long coherence times. 

Regarding E.dat, we did see some evidence for nonlinearity, but we note that that evidence 
is seen only for the smoothed data, and that it is "significant" but not "substantial." Finally, 
because E.dat has a long coherence time, we are further inclined to discount this evidence 
for nonlinearity. 
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and if the roots are precisely on the unit circle but 6; ^ —era;, the time series will diverge to infinity like a 
random walk. If the roots are inside the unit circle, then there will be a finite coherence time. 
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